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ABSTRACT 

An experiment was designed that varied cutting score 
procedures, instructions, and types of judges in order to address the 
followinq questions concerning the Real Estate Licensing Examination: 
(1) Will the cutting score levels produced by groups of judges from 
differing backgrounds (academicians vs. practitioners vs. lawyers) 
using the same method a?^d inrtructions be different? (2) Will the 
agreement between item rating profiles vary across these different 
groups of judges? (31 Does either agreement across items and/or 
levels vary systematically by instruction/method? It was found that 
three out of four groups of judges arrived at significantly higher 
cutting score levels using tht Angoff method than when using the 
Nedelsky procedure. The Angoff procedure was more effective in 
setting standards that c^istine^tiished the minimally qualified 
practitioner from the individual with average-^ qualifications. 
Although the Anaoff method demonstrated somewhat higher interjudge 
agreement with respect to the patterns of item responses the average 
correlation between item profiles was Generally Jow for both 
procedures. (Author/GK) 



* BeproductiOi)s supplied by EDRS are the best that can be made * 

* from the original document. * 



us OCPARTMENrOF HEALTH, 
EDUCATION 4 WELPAic'E 
NATIOW ,w INSTITUTE OP 
EDUCATION 

I THIS DOCUMENT HAS HEEN REPRQ- 
OUCEO EXACTLY AS RECEIVED F^OM 
THE PERSON OR ORGANIZATION ORIGIN. 
ATINGIT POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRp. 
SENT OFFICIAL NATIONAL INSTITUTE Of 
tOUCATfoN POSITION OR POLlO 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIU)/' 




An Empirical Comparison of Judgmental 
Approaches to Standard Setting Procedures 



D. A. Rock, E. L. Davis and C. Weils 




Educational Tasting Sarvica 
Princaton, Naw Jarsoy 
Juna1980 



An Empirical Comparison of Judgmental Approaches 
to Standard Setting Procedures 



D. A. Rock, E. L. Davis and C. Werts 



Educational Testing Service 
Princeton, New Jersey 

June 1980 



Copyright 198^^^^^^^^ Testing Service. All rights reserved. 



ABSTRACT 



An experiment was designed that varied cutting scort procedures, 
Instructions, and types of Judges In order to address the following 
questions: (1) Will the cutting score levels produced by groups of 
Judges from differing backgrounds using the same method and Instructions 
be different? (2) Will the agreement between Item rating profiles 
vary across these different groups of Judges? (3) Does either 
agreement across Items and/or levels vary systematically by Instruction 
or method? 

It was found that three out of four groups of Judges arrived at 
significantly higher cutting score levels using the Angoff method 
than when using the Nedelsky procedure. The Angoff procedure was 
more effective In setting standards that distinguished the minimally 
qualified practitioner from the Individual with average qualifications. 
Although the Angoff method demonstrated somewhat higher Interjudge 
agreement with respect to the patterns of item responses the average 
correlation between item profiles was generally low for both procedures. 
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All Empirical Comparison of Judgmental Approaches 
to Standard Setting Procedures 



INTRODUCTION 

The interpretation of test scores with respect to absolute 
standards rather than to the performance of others has become an 
increasiagly important practice with the advent of evaluation concepts 
such as minimal competence and mastery-non-mastery. The literature 
refers to testing decisions which relate an individual's performance 
to that of others in the same population of test takers as norm- 
referenced testing whili? testing decisions which relate test per- 
formance to absolute standards are frequently referred to as criterion- 
re;^erenced testing. 

I^ile there is much discussion of criterion referenced testing 
per se in the literature (Anastasi, 1976; Millraan, 1974; Popham 
and Husek, 1969i there is very little information on how to set 
cutting scores in criterion-referenced situations. As occupational 
licensing and certification become more widespread, the development of 
systematic and professionally defensible methods of setting cutting 
scores becomes a necessity. 

Ebel (1972) discusses a compensatory item probability method 
that leads to a single passing score. The items of a test are 
classified into a two-way grid with judged item importance and item 



difficulty na the dimenfllons. A fuirher JudRmont in made of the 
proportion of items in each eel] of the grid that must be pnosod by 
a '^minimally qualified barely passlnR" examinee. For earh cell, 
thia proportion and the number of Items on the test placed Into 
that cell are multiplied together. The sum of these products 
(accumulated over all cells) 1.^ the number of items th.u must he 
answered correctly if the test is to be passed. 

Angoff (1971) gives the following item probability method: 
**...ask each judge to state the probability that the minimally 
acceptable person would answer each item correctly. In effect, the 
judge would think of a number of minimally acceptable persons in- 
stead of only one such person who would answer each item correctly. 
The sum of these probabilities, or proportions, would then represent 
the minimally acceptable score." (p, 515). 

A variant to this probabilistic approach was described 
more than 20 years ago by Nedelsky (195A). A passing score for 
multiple-choice items is constructed as follows: For each item, judges 

identify those distractors that the barely passing individual should be 

able to eliminate. The reciprocal of toe number of remaining options 

(including the keyed choice) is calculated for that item. Thus, for 

a five-choice item in which two distractors were judged to be the ones 

that even a barely passing student would not choose, the reciprocal 

is 1/3 or .33. Assuming that the test is scored one point fo»- each correct 

answer a "guessing score" is the sum of these reciprocals computed 
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for all of the Items In the toHt. This ^•guosainR «corc* can be con- 
slderad the cutting score that dlscrlmlnatoa the minlmnlly qunilflcd 
from the non-qualified. Obviously the so-called "guess score'^ is 
not purely n guess score since its estimation is based on partial 
knowledge and in general will lead to a cutting score substantially 
above what one would arrive at using the traditional "guessing" 
formula* 

Andrew and Hecht (1976) in an empicial sCudv compared Ebel '« 
procedure with Nedelsky's. Specif icallyt the study was designed to 
determine (a) whether the cutting score levels for comparable 
samples of items would vary depending upon the standard setting pro- 
cedure used to establish this level and (b) whether for each of the 
two standard setting, procedures the cutting score for a sample of 
test items would vary depending upon the group of judges used. They 
found that within each of the methods there was relatively high 
agreement among the groups of judges with respect to cutting score 
levels. That Is, both methods lead to consistent estimates of a 
cutting score. There were, however, considerable differences between 
the methods on the absolute value of the cutting score. They 
found that the Nedelsky method led to a significantly lower cutting 
score than did the Ebel method. 

The principles of generalizability theory were used by Brennan 
and Lockwood (1979) in their study comparing the Angoff and Nedels' y 
methods of setting cutting scores. They discovered greater vari- 
ability over items in the probabilities generated by the Nedelsky 
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procolurc. Also the Intrarntor vnrlrttlon was aotnewhiu higher for tho 
Ncdelsky method, while the nverago cutting Hcoren produced wore lower. 
Brcnnnn and Lockwood also examined the specific alternatlvon choHcn 
by raters for the Nedelnky method and diacovered that while raiern 
might agree on the number of Item dlatractora to eliminate, they 
might not agree on the specific dlstractorn. 

None of the above studies systematically evaluated the invarlance 
of the cutting score levels across groups of Judges whose backgrounds 
vary. That is, which if any of the methods yields a consistent cutting 
score level across r">Pulations of Judges who are characterized by diverse 
sources of knowledge (e.g. academicians vs. practitioners vs. lawyers). 
If one or more of the methods Is relatively invariant with respect to 
generalizability of results across different populations of experts, 
then the "knotty'* question of who are the most appropriate groups 
to make cutting score judgments becomes less critical. 

Another important question on which there is little or no research 
information is the relative sensitivity or discriminability of the methods 
to variations in instructions. That is, if judges were asked to evaluate 
items with respect to both minimally acceptable persons and persons 
possessing average qualifications, the resulting two cutting scores 
should be well-defined with minimum overlap. That is, other things being 
equal a preferred method would lead to cutting scores which would distin- 
guish the minimally qualified from individuals with average qualifications. 
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In an ©fforc to anwwer ioie of the^e qutPi4thinM an ^xperlmt^nt 
waa designed that varied m^thodfl, ln«truccionH, and typ^s of Judg^w. 
Angoff and Ncdclsky methodH were compared with rewpcct to cutting 
score levelH obtained under InHtructlonii having to do with minimally 
competent IndlvlduaU as well as personH with avoragc competence. 
Four groups of Judges characterized by three different types of 
backgrounds were employed. More specifically, the research addressed 
the following oiUMitlonn. 

1. Does rltlier method produce system^it leal Iv higher cuttlnM 
scores than tlie other? 

2, Do any groups of Judges systematically sot higher cutting 
scores than the others? 

3. How do score judgement for the minimally competent examinee 
differ from Judgement for an examinee with average qualifications? 

4, How do score judgments for the examinee with average qiial i f lent Iohm 
compare to empirical estimates of mean scores based on pre-test 
Item data? 

PROCEDURE 

SUBJECTS: 

The sixteen judges in the present study were members of four 
standing committees used by Educational Testing Service as test ques- 
tion reviewers for the Real Estate Licensing Examination. The 
Minority and Sex Bias Review Committee, four judges, consist's 
of real estate commissioners and administrative officers of real 
estate commissions. They serve as licensing officers «ind in some 
cases are also practicing brokers. The Practicing Broker Review 
Committee, three judges, includes practicing brokers who also are 
state real estate commissioners. The Legal Review Committee' s« 
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four pdrtielpAnin Are @Uher anslPiAnt Attc)rn@yK g<^nerdl Involved 
with r«ial eiRtntci comml^ilcinii or act Ugtil couniK^l lu ^ r^^l t^iiCAr^ 
commlnftion* The remaining group of flv^? Judges art? (rm the Conuultant 
Review Committee which ie composed of profeP»orp of real fs^tate 
coureen from major unlveraitleti and who nh)o hnvr nerv^u aa ite^m 
wrlterg* It waii thought that theeo four panela of Judgea repre- 
sented A broad knowledQe of the compett^nciea of a real eatate 
salesperson and yet represented both t:\e academic viewpoint i%n well 
as that of the practicing brokers. 

DESCRIPTION OF TASKS 

In order to examine both thti Angoff and the Ncdclsky method 
for doterming cutting scores for salesperuons wich minimal and 
average competence, four sets of instruction summarized below, 
were developed: 

INSTRUCTION 01 

Under this task your Judgments about the test questions 
are to be made with reference to your conception of a minimally 
knowledgeable salesperson. You will judge what percentage of 
the salespersons in this minimally knowledgeable group would 
know the answer to each question and then mar': on an accompany- 
ing coded answer sheet the percentage that comes closest to 
your judgment. 

INSTRUCTION #2 

Under this task your Judgments about the test questions 
are to be made with reference to your conception of a prac- 
ticing salesperson of average knowledge. You irtll Judge what 
percentage of the salespersons in this average knowledge group 
would know the aniwers to each question and mark on an 
accompanying coded answer sheet the percentage rhat comes 
closest to your judgment. 
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INSTRUCTION #3 

Fer chU cask, you wUl iniip«ct each item dlatractor 
and Identify those dlstractors which the mlnlmaHy kmiwlP.iM^ahh^ 
saltiperson should be able to eliminate. That Is, you will 
Identify those dlstractors which a wlnlmaUv kntiwlpapeahle Maioa- 
person would recognise as being pbvloualy wi'ong. On an 
estrenely easy Item, this might be all the dlstracter options 
(leaving only the keyed option). On a very difficult Item, 
you may feel that a minimally knowl^dgeablp Individual m.»v not 
be able to eliminate any dlstractor option. On V'^jr coded 
answer sheet, you will circle the dlstractor (s) w'jlch would 
be eliminated by a minimally knowItHlgi«rthU' MrtlfSfUTtion. 

INSTRUCTION «»i 

For this task, you will Inspect each Item dlstractor 
and Identify those dlstractors which the typical or average 
knowlptipprtble Hdlt^sperson would hp ahlv to oltmtnace. Ttuit Ih. 
you will Identify those dlstractors which n salesperson 
possessing nveraRe knowl«dfC would recognize as bring obvl- 
ouBly wrong. On an extremely easy Item, this might be all 
dlstractors except the keyed response. On a very hard Item, 
a salesperson with average knowledge may not be able to eliminate 
any dlstractors. On your coded answer sheet, you will circle 
the dl8tractor(8) which would be eliminated by > typical 
salesperson ponsesslng average knowledge. 



Instruction '»! and Instruction "2 were the mlnim:il .ind .iVfr.iRc' oual 1 f Icat fon 
Instructions far cho Anpoff mothoJ and Instructions "1 and H'* were the 
corresponding ouallflcatlon Instructions for the N'odelskv method. The 
presentation of four sets of Instructions, alonR with four specially 
developed parallel forms of the Real Estate Examination, was counter- 
balanced over the four groups of iudges. This war done In the following 
manner: 
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GROUP 1 (Minority and Sex Bias 
Review Committee) 



Instruction //4 
:4truction //3 
Xiistruction //2 
Instruction //I 



Form 3 
Form 1 
Form 2 
Form 4 



GROUP 2 (Practicing Broker 
Review Committee) 



Instruction //2 
Instruction #1 
Instruction //4 
Instruction rfS 



Form 1 
Form 3 
Form 4 
Form 2 



GROUP 3 (Legal Review 
Committee) 



Instruction //3 
Instruction #4 
Instruction //I 
Instruction //2 



Form 4 
Form 2 
Form 1 
Form 3 



GROUP 4 (Consultant Panel) 



Instruction //I 
Instruction #2 
Instruction //3 
Instruction //4 



Form 2 
Form 4 
Form 3 
Form 1 



In this way, the experiment partially controlled for 'ooth practice 
and form effects. Each of the four parallel forms consisted of 64 
four-choice items. 

The structural model for the experimental design was: 

ijkm - 1 j m(i) 13 ik 



.ik Jin(i) km(i) '1,1k 



Where: ^^ji^ni " cutting score for the mth judge in tHe i_th group under 
the jth instruction and kth method. 

= group (i =1,4) 
3^ = instruction (j = 1, 2) 
= method (k = 1, 2) 



^m(i) nested within the i_th group, 



RESULTS 



The experimental effects in the repeated measures design were estimated 
using least squares procedures. Table 1 gives the means for the four groups 
of judges, two methods, and the skill level instructions. Table 2 presents 
the analysis of variance of the cutting score levels. The reader will 
note that while a significant group main effect was observed, there was 
also a significant group X method interaction. As can be seen by looking 
at the mean cutting scores presented in Table 1, the Angoff method consist- 
ently resulted in the setting of higher cutting scores. Also the cutting 
score for the average competence instruction was higher than the minimally 
competent instruction for both methods across all groups. Regardless of 
the group of judges, the Angoff method produced the smallest variation 
between cutting score levels across both method and instruction. Figure 
1 presents a plot of theV means for each method for each of the four groups • 
The interaction is disordinal, that is, all groups with the exception of 

Group 3 (the Legal Review Committee) obtained considerably higher cutting 
scores for the Angoff procedure. It is also interesting to note that the 
judges with an academic background (Ciroupt 4) had the largest method effect. 

In Table 2 the significant main effect for instruction and the lack of 
a statistically 'significant interaction between instruction and method 
suggests that both methods are capable of yielding cutting scores that 
discriminate the minimally qualified from individuals of average 
qualifications. However, a clps.er inspection of the data indicates 
that the interaction between methods and instructions fell just short 
of significance (p = .06). A comparison of the spread between the 
mean cutting scores for minimally qualified and those who have average 
qualifications for the two methods indicate- that the Angoff method 



Table 1 



Group Mean Cutting Scores by Instructions and Methods 

Angoff Nedelsky 
Minimal Average Minimal Average 

Score Percent of Total Score Percent of Total Score Percent of Total Score Percent of Total 



40.085 


62.6 


44.972 


70.3 


28.312 


44.2 


32.250 


50.4 


35.013 


54.7 


41.936 


65.5 


27.639 ■ 


43.2 


31.833 


49.7 


38,155 


59.6 


48.445 


75.7 


40.542 


63.3 


45.854 


71.6 


43.484 


67.9 


53.035 


82.9 


27.517 


43.0 


28.304 


44.2 



*Group 1 consisted of four members of the Minority and Sex Bias Review Committee. 
Group 2 consisted of three members of the Practicing Broker Review Committee. 
Group -3 consisted of four members of the Legal Review Committee. 
Group 4 consisted of five members of the Consultant Review Committee. 
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Table 2 

Analysis of Variance Table for the Cutting Scores 



Source 



Degrees of Freedom 



F Ratio 



Between 
Grand Mean 
Error 

Group 
Error 

Within 

Instruction 
Error 

Instruction X Group 
Error 

Method 
Error 

Method X Group 
Error 

Instruction X Method 
Error 

Instruction X Method X Group 
Error 



16 

1 
15 

3 
12 

48 

1 
15 

3 
12 

1 
15 

3 
12 

1 
15 

3 
12 



1114.1147* 



4.4268** 



41.1110* 



0.6257 



18.1910* 



5.4313* 



4.1533 



0.5276 



* p < .01 
** p < .05 
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appears to be somewhat more discriminating. That is, the Angoff method 
yielded cutting score means of 39,18 and 47,10 for the minimally qualified 
individuals and individuals possessing average qualifications, respectively. 
The comparable figures for the Nedelsky method uere 31,00 and 34,56, 

With respect to cutting score level the results suggest that item 
probability judgments from most populations of experts will give signi- 
ficantly higher cutting score levels when using the Angoff method than 
when using the Nedelsky method. Although both methods appear to be sble 
to yield cutting scores which discriminate the "idealized" individual 
having minimal qualifications, the Angoff method seems to be somewhat 

more discriminating than the Nedelsky method. 

In order to investigate levels of agreement inter-judge correlations 
across their item judgments were computed, transformed using Fisher's 
r to z,and then averaged within the cells of the original design. High 
correlations between pairs of judges within the same cell indicate that 
the vector profiles generated by their respective judgments in the same 
set of items are similar. It would seem that preferred methods would 
demonstrate both a higher inter-judge agreement with respect to item 
judgments as well as greater consistency with respect to cutting score 
level bot:h within and across populations. 

Although it is tempting to use the transformed correlations as 
dependent variables in the previous analysis of variance design, this 
would leave the unsolved problem of how to determine the appropriate 
degrees of freedom for error terms as well as an acceptable method 
for correcting the varying dependencies among the within cell 
correlations. However, a simple comparison of the Angoff and 

19 
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Nedelsky methods with respect to their average intercorrelations 
indicate that there was somewhat greater inter-judge agreement in item 
profiles for the Angoff method (r = .28) than for the Nedelsky 
method (r = .13) . 

Although the average intercorrelation is virtually equal for 
the minimum and average instructions (r = .214 and .215 respectively) 
there appears to be an interaction with method. That is, the Angoff 
irethod yielded average intercorrelations of .32 and .24 for minimum and 
average qualifications while the corresponding figures for Nedelsky 
were .11 and .19. Although the agreement was generally low for both 
methods, it appears that there was somewhat more agreement under the 
Angoff procedure. The differences in correlations do not appear to be 
the results of systematically smaller within-judge variance across 
items for either method. That is, there was no systematic difference 
in the range of item ratings. the judges gave items under the different 
methods. 

Group membership and inter- judge agreement also showed some 
interesting relationships. Group 3 (the Legal Committee) and Group 4 
(professors of real estate) demonstrated higher within group agreement 
regardless of method and instruction (7 = .33 and .29 respectively) than 
did either the Sex and Minority Group (r .09) and the Practicing 
Brokers Group (r - .15). The lawyers appear to be more consistent with 
respect to both cutting score level and inter-judge agreement across 
methods. 

The academicians (Group 4) were characterized by the least stability 
in cutting score levels across methods yet they demonstrated almost as 
much inter-judge agreement within method as did the lawyers. 

20 
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The Angoff method judgements for Individuals with average qualifications 
yielded a cutting score that was somewhat less than the estimated average 
score for the applicant population (47.10 versus 51.99). The parallel 
estimate using the Nedelsky method was considerably lower than the 
estimated applicant mean score (3A.56 versus 51.99). This estimate of 
the applicant population mean score was based on item pretest data. 

The Nedelsky derived cutting score levels are sufficiently low that 
one must question their usefulness In practical sltue^tlons except as a 
prescreenlng device rather than a final or sole criterion for licensing. 
Knowledge of less than half of the information judged as relevant to 
performing an occupation does not seem to be sufficiently rigorous criteria 
f f r licensing. The Angoff cutting score seems to be somewhat closer to 
the ''mark'* In that when considering an Individual with average knowledge 
the judges arrived at a cutting score much closer to the mean score 
for the applicant population. 
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DISCUSSION 

Certain of the results confirm the findings of the Andrew and 
Hecht (1976) and the Brennan and Lockwood (1979) studies. In particular. 
It was found that item probability methods based on the Angoff procedure 
tended to yield significantly higher cutting scores than the procedure 
outlined by Nedelsky. These findings applied to both the minimal and 
average qualification Instructions. In addition, this study Indicated 
that the Angoff method showed somewhat higher Inter-judge agreement and 
was better able to define cutting scores with less overlap when judging 
on the basis of minimally qualified individuals vs. those with average 
quallf icatlons« 

The question arises: Why or how did the group of lawyers manage 
to arrive at the same cutting score estimate for both the Angoff and 
Nedelsky methods? One possibility Is an experimenter effect. That 
Is, In any field experiment with human subjects there Is a possibility 
that the subjects or some class of subjects may consciously or unconsciously 
perceive that a positive goal of their task would be to orient their 
behavior to bring about what they see as consistent results. In fact, 
in the case of lawyers, their training and experience may encourage 
this sort of need for consistent answers regardless of the path taken 
to arrive at the answer. 

Observations made during the experiment suggest that the short 
training session with examples which were offered before the experiment 
began may not have been sufficiently comprehensive for a complete under- 
standing of the tasks by all group members. Questions from participants 
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indicated that they found the Nedelsky task far more difficult to carry 
out. It Is felt that this Dosslbly Incomplete and differential under- 
standing of the Nedelsky tasks by some participants contributed to the 
lower level of agreement than was found In the Angoff tasks. The 
difficulty of the Nedelsky task for some participants was underscored 
by the fact that on the average It took twice as long to comolete as the 
Angoff method. 
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CONCLUSIONS 



Three out of four groups of judges arrived at significantly higher 
cuttinjr score levels using the Angoff method than when using the 
Nedelsky procedure. The Angoff procedure was more effective in setting 
cutting score levels that distinguished the minimally qualified 
practitioner from the individual with average qualifications. Although 
inter-judge agreement with respect to the pattern of item responses 
was generally low for both procedures, the Angoff method demonstrated 
somewhat higher agreement. 
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